Probability and Statistics: The Science of Uncertainty: Defining Optimality in Statistical Inference

In the vast wilderness of statistical data, we are hunters seeking the truth—the true parameter $\psi(\theta)$. But how do we decide which arrow (estimator) is best? Optimality is not a vague feeling; it is the mathematical art of minimizing loss. To find the 'best' estimator, we look to the Mean Squared Error (MSE), which elegantly decomposes into the tension between two fundamental forces: Variance and Bias.

Defining the Gold Standard: MSE

To quantify how far our guess $T$ is from the reality $\psi(\theta)$, we define the Mean Squared Error (Definition 6.3.1):

$$MSE_\theta(T) = E_\theta((T - \psi(\theta))^2)$$

This is the average squared distance between our estimator and the target. A perfect estimator would have an MSE of zero, but in a world of random noise, we strive to minimize it.

Theorem 8.1.1: The Architecture of Error

Why does an estimator fail? Theorem 8.1.1 provides the blueprint. If $T$ has a finite second moment, the error relative to any constant $c$ is given by:

$E((T - c)^2) = \text{Var}(T) + (E(T) - c)^2$

This formula reveals that the total squared error is minimized only when we choose $c = E(T)$. In the context of inference, we set $c = \psi(\theta)$, leading to the famous decomposition:

MSE = Variance + Bias$^2$

The Precision-Accuracy Tradeoff

Imagine two weighing scales in a quality control lab:

The Precise Relic: It gives the same weight every time (low Variance) but is miscalibrated by 2 grams (high Bias).
The Erratic Sage: It is correct on average (zero Bias) but oscillates wildly between measurements (high Variance).

Theorem 8.1.1 allows us to calculate exactly which scale provides the lower total error. Often, we are willing to accept a small amount of systematic deviation (Bias) if it drastically reduces the noise (Variance).

Example 8.1.1: Sufficiency and Information

Optimality is tied to Information. Consider a sample space $S = \{1, 2, 3, 4\}$. If outcomes 2, 3, and 4 are equally likely under every possible parameter, they carry the same likelihood. We can define a sufficient statistic $U$ that groups these outcomes together without losing any ability to make an optimal inference. As shown in the simulation, if $L(\cdot|2) = L(\cdot|3) = L(\cdot|4)$, an optimal estimator treats these as a single informative event.

🎯 Core Principle

An estimator is optimal when it minimizes the expected loss. For squared error loss, this means finding the point where the sum of Variance and Bias² is at its absolute minimum.

QUESTION 1

Suppose that (x₁, ..., xₙ) is a sample from an N(μ, σ₀²) distribution, where μ is unknown and σ₀² is known. Determine a UMVU estimator of the second moment μ² + σ₀².

T = x̄² + σ₀²(1 - 1/n)

T = x̄² + σ₀²

T = x̄² - σ₀²/n

T = Σxᵢ² / n

QUESTION 2

According to Theorem 8.1.1, what value of 'c' minimizes the expression E((T - c)²)?

c = ψ(θ)

$c = E(T)$

$c = Var(T)$

$c = 0$

QUESTION 3

In the context of Mean Squared Error, what is Bias(T) defined as?

E(T) - ψ(θ)

$Var(T) - E(T)$

ψ(θ) / E(T)

E(T²) - [E(T)]²

QUESTION 4

In Example 8.1.1, why is U(s) a sufficient statistic when U(2)=U(3)=U(4)=1?

Because the likelihoods L(θ|2), L(θ|3), and L(θ|4) are identical for all θ.

Because the probabilities sum to 1.

Because s=1 has the highest probability.

Because the sample space is finite.

QUESTION 5

If an estimator is unbiased, its MSE is equal to:

Its Variance

Its Bias squared

Zero

The true parameter value